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ABSTRACT 


Indonesia is ranked the top 8" out of the total country population in the world 
for the global spammers. Web-based spam filter service with the REST API 
type can be used to detect email spam in the Indonesian language on 
the email server or various types of email server applications. With REST 
API, then there will be data exchange between the applications with JSON 
data type using existing HTTP commands. One type of spam filter 
commonly used is Bayesian Filtering, where the Naive Bayes algorithm is 
used as a classification algorithm. Meanwhile, the N-gram method is used to 
increase the accuracy of the implementation of the Naive Bayes algorithm 
in this study. N-gram and Naive Bayes algorithms to detect spam email in 
the Indonesian language have successfully been implemented with accuracy 
around 0.615 until 0.94, precision at 0.566 until 0.924, recall at 0.96 until 
1.00, and F-measure at 0.721 until 0.942. The best solution is found by using 
the 5-gram method with the highest score of accuracy at 0.94, precision at 


0.924, recall at 0.96, and F-measure value at 0.942. 
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1. INTRODUCTION 

Spam is an unsolicited email that is sent to the crowd [1]. According to Suryanto [2], the number 
of spam emails in the world increases exponentially every year. Based on recent spam statistics data from 
AV-test, Indonesia is ranked 8" of the total country population in the world for the global spammers [3]. 
The regulations on spam spreading in Indonesia have not been explicitly regulated in the Information 
and Transaction Act Electronic (Law No. 11 Year 2008/UU ITE). However, spam delivery can be 
categorized in deeds is forbidden in Chapter VII, articles 27-34, more precisely chapter 33 [4]. 

Various researches have been done related to spam detection and filter as we can see in the works 
of Nagwani and Sharaff [5], Sah and Parmar [6], Bhuiyan et al. [7], Ezpeleta et al. [8], and Jawale et al. [9]. 
However, the most used method to prevent spam is a text mining method with Bayesian filtering. Even 
though many advanced text mining techniques have been developed [10-15] and comparison between 
different methods has been done [16-18], the Naive Bayes algorithm is considered simple and has a fast 
computation [19, 20]. Moreover, the N-gram method is used to add the accuracy of the Naive Bayes 
algorithm inside the spam classifier, as we can see in [21-23]. 

Web service is defined as an interface that describes a set of operations that can be accessed 
through the network [24]. Web service usage aims to be used on mail servers and mail clients on 
various platform types [25]. The most used protocol to access API is REST [26]. The main advantage of 
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using REST is the bandwidth used is less than SOAP because SOAP requires XML wrapper for every 
request and response [25]. Text mining can be used to handle problems of classification, clustering, 
and information extraction and retrieval [27]. Text mining and data mining are differed from the source used. 
The data source used in data mining is structured data, while the data source used on text mining 
is unstructured data in text form [4]. The initial stage in text mining is text pre-processing, 1.e., the process of 
changing the form of data not yet structured into structured data according to needs, which are done for more 
mining processes to continue. The steps in the pre-processing text, in general, are case-folding, tokenizing, 
filtering, and stemming [27]. Case-folding is a process to change all the characters in the document into 
lowercase [28]. Tokenizing is the cutting stage of input text into words, terms, symbols, punctuation, 
or another element that has a meaning called a token [29]. Filtering is the stage of picking up essential words 
of results token [30]. Stemming is the process of mapping variance morphological words in the base or 
general word [31]. 

In this research, we are trying to detect spam email in the Indonesian language using a text mining 
method, namely the Naive Bayes classifier, which was enhanced with the N-gram method. Different from 
previous research, in this study, we propose and implement both Naive Bayes and N-gram methods as a web 
service using REST API design. Further information on those methods and REST API design is given in 
the following section. Section 3 delivers the implementation results and analysis of spam detection results by 
calculating the accuracy, precision, recall, and F-measure. In the end, some concluding remarks will be given 
in section 4. 


2. RESEARCH METHOD 
2.1. N-gram 

A set of n-character which is taken from a string is called N-gram [32]. The N-gram method was 
used for taking pieces of capital letters in a continuous word from the source until the end of the string. 
If n=1 then it’s a unigram, if n=2 it’s a bigram, and if n=3 it’s a trigram. For example, the word "bagus" can 
be formed into several N-gram as: 
— Unigram : b, a, g, u, s 
— Bigram :_b, ba, ag, gu, us, s_ 
— Trigram :_ba, bag, agu, gus, us_, S_ 

Blank “ ” character is used to represent space on the beginning and on the end of the word. 
The advantage of using N-gram is based on the characteristics of the N-gram as a part of a string, so 
the error on a partial string only resulting in a difference in some N-gram [33]. Another representation 
and usage of N-gram advantage can also be seen in the publication of Tayyeh and Al-Jumaili [34]. 


2.2. Naïve Bayes 

Naïve Bayes algorithm is advanced by English scientists, Thomas Bayes. This algorithm utilizes 
the probability and statistics methods to predict probability in the future based on past experience [35]. 
It requires a small amount of data training [36], and the basis of the Naive Bayes theorem used in 
programming is the following Bayes formula [37]. 

In (1) shows the probability of occurrence of A when B determined from the probability B when A, 
probability A, and probability B. Naive Bayes classifier or may be referred to as multinomial Naive Bayes 
is a simplified model of the Bayes algorithm that fits in the classification of text or documents by (2) like 
the following [35]. 


P (A|B) = (P(BIA) * P(A))/P(B) (1) 
Vuap = arg MaXy sev Mis P(x |Vj JPM) (2) 
where: Vi ap = Category or class that has the highest posterior 

V; = Category or class j=1, 2, 3, ..., n 

Xi = Words, i=1, 2, 3, ..., n 

P(x; lV; ) = Probability x; in category V; 

P(V;) = Probability of V; 

argmax = Domain that has the greatest value 


V; € V = V; = The element or set of V 
For P(V;) and P(x; lV; ) calculated by (3) and (4) as follows [35]. 
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_ | docs; | 
P(V;) 7 | sample | (3) 
_ Ngkt1 
P(x;|Yj) 7 n+|words| (4) 
where: | docs; | = Total number of documents in j 

| sample |= Total number of documents 
Nk = Number of occurrence for each word 
n = Number of word occurrence for each category 
|words| = Total number of words from all categories 


2.3. REST API 

The term REST which stands for representational state transfer was first used by Roy Thomas 
Fielding, one of the pioneers of the Apache webserver project, in his doctoral dissertation at the University 
of California in 2000 [38]. REST API architecture components are client applications, networks, and web 
services. The client application sends an HTTP request containing methods like GET, PUT, POST and others 
to web services over the network. Design of application programming interface (API) includes the design 
method and design of JSON. The design method is used to design the URI request pattern and what type of 
method is used to send HTTP requests. JSON design is used to design JSON data that is sent to the client. 

Figure 1 shows the API flow spam filter created in this system. There are seven functions that can be 
sent by the client application to web service through an API gateway. The seven functions are user key, 
remove user key, user list, add dataset, list dataset, remove dataset, and check spam. The usefulness of 
the seven functions is explained in the design method. Client application sends request via HTTP request 
using specified method. When sending an HTTP request, the client application must use authentication. 
Authentication that is sent by the client application is the header with parameter "X-API-KEY" filled by key 
owned by each client application. After the client application sends the request, the web service responds in 
the form of JSON to the client application. The results of these responses are then managed by the client 
application in accordance with their needs. 


Authenticated 


Generate Key User 


Remove Key User 


any SET 
m {s Í 


; 4 API Gateway 
Client Application Web Serice 


Remove Dataset 





Figure 1. API flow of spam filter 
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3. RESULTS AND DISCUSSION 

Based on the design, there is seven main design of the application interface, namely content page, 
visitor menu, user menu, sign in menu, register menu, console menu, admin page. Figure 2 shows 
the implementation of the content page design. In accordance with the design, there are headers, sidebar, 
footer, and content columns consisting of four components of explanation column, input form field, try 
button, and result field. Implementation of content page design is used in case folding, tokenizing, filtering, 
stemming, and N-gram. Figure 2 is the content page of the case folding menu. In the case of case-folding, 
there is a form to enter a sentence. The sentence is used to simulate the result of the case-folding process. 
When a visitor or user is pressing the try button, then the system to process the case folding and display 


the results. 


Case-folding 


Case-folding 


kanci 
Tokenizing Case-folding adalah mengubah semua huruf 


C dalam dokumen menjadi huruf kecil. 
Filtering 
View details » 
Stemming 


N-gram 


Naive Bayes 


ARI Stemming 


Stemming adalah proses pemetaan variansi 
morfologikal kata dalam kata dasar atau kata 
umumnya. 


Spam Filter 


Tokenizing 


Tokenizing adalah tahap pemotongan teks input 
menjadi kata, istilah, symbol, tanda baca, atau 
elemen lain yang memiliki arti yang disebut 


token. 


View details » 


N-gram 


N-gram adalah potongan n-karakter yang diambil 
dari suatu string. 


View details » 


Filtering 


Filtering adalah tahap mengambil kata-kata 
penting dari hasil token. Biasanya dilakukan 
dengan cara menggunakan stop-word. Stop-word 
adalah kata yang bukan merupakan kata unik 
dalam suatu artikel atau kata-kata umum yang 
biasanya selalu ada dalam suatu artikel. 


View details » 


Algoritma Naive Bayes 


Algoritma Naive Bayes adalah algoritma 
klasifikasi yang bertujuan untuk menemukan 
model atau fungsi yang menjelaskan atau 


membedakan konsep atau kelas data, dengan 
tujuan untuk dapat memperkirakan kelas dari 
suatu obyek. 


View details » 


View details » 


Home - Signup - Signin 





Figure 2. The content page 


Figure 3 is an implementation of the console menu design. In the console menu there are forms, sent 
buttons, and results columns. The form in the console menu consists of an input form to include HTTP 
requests, API keys, and body parameters. When the user pressed the send button, the system runs the HTTP 
request entered by the user. Then the result field displays the JSON result of the sent request. 


Overview 


Console 


Case-folding 


Tokenizing GET -> http://localhost/SpamFilter/listUser 





Filtering ---HEADER--- 


Stemming X-API-KEY 


© SENT 


44wo44ksk8so4wc8o4wgksc800kcs48kccog4w88 


N-gram 


API 


Answer 


Array 
( 
[status] => 1 
[user] => Array 
( 
[0] => Array 
( 
[id] => 5 
[email] => asal 
[key] => Ocwo4cOksk@s8cOoscwcO8ks@8woccgkOwswoswk 


) 





Figure 3. Implementation of menu console design 
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Figure 4 shows the implementation of the admin page design. In accordance with the design, on 
the admin page consists of a header that consists of the application name and sign out menu. Then the content 
column on the admin page displays user information. For each single user information, there are button 
suspend, active, and upgrade. If the suspend button is pressed then the user cannot use an HTTP request. 
If the active button is pressed, then the user can use an HTTP request. If the upgrade button is pressed then 
the user can use an HTTP request without any limit. 

Spam filter trial is done by several methods of measurement, which are accuracy, precision, recall, 
and f-measure. The values obtained from the calculation ranges from 0 to 1, where higher value means better 
result, and vice versa. Spam filter testing uses 100 spam category documents and 100 ham category 
documents with training data of 200 spam-category documents and 200 ham category documents. URI 
dataset contains all data used as training and test data in this study [39]. The total amount of training and test 
data in this study is 600 documents which were obtained from 30 people. Everyone gives ten spam emails 
and ten ham emails. The first 20 data were used as training data while the last ten data were used as test data. 


Se URC 


Name Want 
ID Username Application Key Upgrade 


4  yustinus_vernanda@yahoo.com 44wo44ksk8so4wc804weksc800kcs48kccog4w88 0 


9 test@test.com http://test.com OkscwwOok8408cogwsgksOo4ss8k8cOwkc48w448 0 





Figure 4. Implementation of admin page design 


Table 1 shows the test results for spam filters on each N-gram method. Based on the test results, 
the lowest accuracy, precision, and f-measure value is spam filter using the 1-gram method with an accuracy 
value of 0.615, a precision value of 0.566, and an f-measure value of 0.721. While the lowest recall value is 
spam filter using the 5-gram method until the 10-gram method with accuracy value equal to 0.96. Then for 
the highest accuracy, recall, and f-measure value is spam filter using the 5-gram method with an accuracy 
value 0.94, a precision value equal to 0.924, and an f-measure value 0.942. Meanwhile, the highest recall 
value is a spam filter that uses the 2-gram method with a value of 1. 


Table 1. Results of spam filter 


N-gram Accuracy Recall Precision F-measure 
0 0.935 0.97 0.907 0.938 
1 0.615 0.99 0.566 0.721 
2 0.64 1 0.582 0.736 
3 0.695 0.99 0.623 0.765 
4 0.89 0.97 0.837 0.899 
3 0.94 0.96 0.924 0.942 
6 0.935 0.96 0.915 0.937 
7 0.935 0.96 0.915 0.937 
8 0.935 0.96 0.915 0.937 
9 0.935 0.96 0.915 0.937 
10 0.935 0.96 0.915 0.937 


Figure 5 shows a graph of test results on a spam filter. In the graph of test results on spam filters, 
it can be said that the 6-gram method onwards does not lead to significant changes in the implementation 
of the Naive Bayes algorithm. The argument is obtained based on the analysis of accuracy, precision, recall, 
and f-measure values that do not change or stable on the 6-gram method until 10-gram with a precision 
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value equal to 0.915, a recall value equal to 0.96, an accuracy value equal to 0.935, and an f-measure value 
equal to 0.937. 


1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram 


Akurasi em Recall Precision = F-measure 





4. 


Figure 5. Spam filter test graph 


CONCLUSION 
In this study, the N-gram method and Naive Bayes algorithm had been successfully implemented to 


detect Indonesian language spam using REST API architecture. From the experimental results, it can be 
concluded that the accuracy values ranged from 0.615 to 0.94, the precision values ranged from 0.566 to 
0.924, the recall values ranged from 0.96 to 1, and the f-measure values ranged from 0.721 to 0.942. 
The 6-gram method and later did not have any significant change. Meanwhile, the best N-gram method 
that gives the highest accuracy, precision, and f-measure values in detecting Indonesian language spam is 
the 5-gram method when combined with the Naive Bayes algorithm. 
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